Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML API: Add get full comment text method #7342

Conversation

sirreal
Copy link
Member

@sirreal sirreal commented Sep 12, 2024

Trac ticket: Core-62036

There are certain circumstances in the HTML where the full contents of a comment node as it would be in the browser cannot be known without inspecting internal state of the HTML API classes.

See #7331 (comment) for an example and details.

In short, HTML parsing enters "bogus comment state" which may be represented several ways in the HTML processor, but comments like <!c> and <?c> have no way of knowing whether the comment would be equivalent to <!--c--> or <!--?c-->. They're both apparently <!--c--> (if we use get_modifiable_text() to inspect comment text), although in fact <!c> becomes <!--c--> while <?c> becomes <--?c-->.

Additionally, it makes it clear what "lookalike" comment types would have as their comment text, so CDATA and Processing Instruction lookalike comments can also be queried simply:

  • <![CDATA[foo]]> becomes <!--[CDATA[foo]]-->
  • <?pi bar?> becomes <!--?pi bar?-->

This is useful for the html5lib tests or anyone seeking to understand the comment text content as a browser would.

This should be helpful for normalization: #7331

This fixes 3 tests in the HTML5lib test suite with known failures (due to being unable to determine the comment contents described above).

Trac ticket:


This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.

@sirreal sirreal marked this pull request as ready for review September 12, 2024 17:23
Copy link

github-actions bot commented Sep 12, 2024

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props jonsurrell, dmsnell.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

Copy link

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

  • The Plugin and Theme Directories cannot be accessed within Playground.
  • All changes will be lost when closing a tab with a Playground instance.
  • All changes will be lost when refreshing the page.
  • A fresh instance is created each time the link below is clicked.
  • Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
    it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

Copy link
Member

@dmsnell dmsnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, and resolves a problem. I wanted to merge this with get_modifiable_text() (but not set_modifiable_text()), except you are right, and it's incompatible. This will be more appropriate when calling get_inner_text() or get_text_content() when that eventually appears, as there's no notion of "modifiable" that it conveys.

pento pushed a commit that referenced this pull request Sep 20, 2024
Previously, there were a few cases where the modifiable text read from an HTML comment differs slightly from the parsed value of its inner text in a browser. This is due to the specific way that invalid HTML syntax tokens become "bogus comments."

This patch introduces a new method to the Tag Processor to allow differentiating these specific cases, such as when copying or serializing HTML from one source to another. Similar code has already been in use in the html5lib tests, and this patch simplifies the test runner, evidencing the fact that this method was already needed.

Developed in #7342
Discussed in https://core.trac.wordpress.org/ticket/62036

Props dmsnell, jonsurrell.
See #62036.


git-svn-id: https://develop.svn.wordpress.org/trunk@59075 602fd350-edb4-49c9-b593-d223f7449a82
markjaquith pushed a commit to markjaquith/WordPress that referenced this pull request Sep 20, 2024
Previously, there were a few cases where the modifiable text read from an HTML comment differs slightly from the parsed value of its inner text in a browser. This is due to the specific way that invalid HTML syntax tokens become "bogus comments."

This patch introduces a new method to the Tag Processor to allow differentiating these specific cases, such as when copying or serializing HTML from one source to another. Similar code has already been in use in the html5lib tests, and this patch simplifies the test runner, evidencing the fact that this method was already needed.

Developed in WordPress/wordpress-develop#7342
Discussed in https://core.trac.wordpress.org/ticket/62036

Props dmsnell, jonsurrell.
See #62036.

Built from https://develop.svn.wordpress.org/trunk@59075


git-svn-id: http://core.svn.wordpress.org/trunk@58471 1a063a9b-81f0-0310-95a4-ce76da25c4cd
github-actions bot pushed a commit to gilzow/wordpress-performance that referenced this pull request Sep 20, 2024
Previously, there were a few cases where the modifiable text read from an HTML comment differs slightly from the parsed value of its inner text in a browser. This is due to the specific way that invalid HTML syntax tokens become "bogus comments."

This patch introduces a new method to the Tag Processor to allow differentiating these specific cases, such as when copying or serializing HTML from one source to another. Similar code has already been in use in the html5lib tests, and this patch simplifies the test runner, evidencing the fact that this method was already needed.

Developed in WordPress/wordpress-develop#7342
Discussed in https://core.trac.wordpress.org/ticket/62036

Props dmsnell, jonsurrell.
See #62036.

Built from https://develop.svn.wordpress.org/trunk@59075


git-svn-id: https://core.svn.wordpress.org/trunk@58471 1a063a9b-81f0-0310-95a4-ce76da25c4cd
@dmsnell
Copy link
Member

dmsnell commented Sep 20, 2024

Merged in [59075]
675a1aa

@dmsnell dmsnell closed this Sep 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants